Enhancing the prediction of protein coding regions in biological sequence via a deep learning framework with hybrid encoding

نویسندگان

چکیده

Protein coding regions prediction is a very important but overlooked subtask for tasks such as of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed this problem, they first encode biological sequence into numerical values and then feed them classifier final prediction. However, encoding schemes directly influence the classifier's capability to capture features how choose proper scheme remains uncertain. Recently, we protein region method in transcript sequences based on bidirectional recurrent neural network with non-overlapping 3-mer feature, achieved considerable improvement over existing methods, there still much room improve performance. First, feature that counts occurrence frequency trinucleotides only reflects local order information between most contiguous nucleotides, which loses almost all global information. Second, kmer length k larger than three (e.g., hexamer) may also contain useful Based two points, here present deep framework hybrid sequences, effectively exploit information, gapped (gkm) statistical dependencies among labels. 3-fold cross-validation tests human mouse demonstrate our significantly outperforms state-of-the-art methods. • Coding labels dependency can extend from genomic sequence. Hybrid single scheme. Global be captured by CNN.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

network of phonological rules in lori dialect of andimeshk: a study within the framework of post-generative approach.

پژوهش حاضر ارائه ی توصیفی است از نظام آوایی گویش لری شهر اندیمشک، واقع در شمال غربی استان خوزستان. چهارچوب نظری این پژوهش، انگاره ی پسازایشی جزءمستقل می باشد. این پایان نامه شامل موارد زیر است: -توصیف آواهای این گویش به صورت آواشناسی سنتی و در قالب مختصه های زایشی ممیز، همراه با آوانوشته ی تفصیلی؛ -توصیف نظام آوایی گویش لری و قواعد واجی آن در چهارچوب انگاره ی پسازایشی جزءمستقل و معرفی برهم کن...

Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it...

متن کامل

Prediction of Protein Coding Regions on a Programmable Vliw Architecture

Gene annotation is by nature a computationally intensive problem, as it needs to process huge data size of DNA sequences. This forces the need to look for alternate ways of implementing algorithms to predict exons. The paper presents a hardware-based approach in which a Digital Signal Processor is programmed to compute the computationally expensive part of the algorithm. The processor effective...

متن کامل

Molecular sequence accuracy and the analysis of protein coding regions.

Molecular sequences, like all experimental data, have finite error rates. The impact of errors on the information content of molecular sequence data is dependent on the analytic paradigm used to interpret the data. We studied the impact of nucleic acid sequence errors on the ability to align predicted amino acid sequences with the sequences of related proteins. We found that with a simultaneous...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Digital Signal Processing

سال: 2022

ISSN: ['1051-2004', '1095-4333']

DOI: https://doi.org/10.1016/j.dsp.2022.103430